unstructured data
From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When
Large language models (LLMs) like transformers demonstrate impressive in-context learning (ICL) capabilities, allowing them to makepredictions for new tasks based on prompt exemplars without parameter updates. While existing ICL theories often assume structured training data resembling ICL tasks (e.g., x-y pairs for linear regression), LLMs are typically trained unsupervised on unstructured text, such as web content, which lacks clear parallels to tasks like word analogy. To address this gap, we examine what enables ICL in models trained on unstructured data, focusing on critical sequence model requirements and training data structure. We find that many ICL capabilities canemerge simply from co-occurrence of semantically related word pairs in unstructured data; word analogy completion, for example, can provably arise purely through co-occurrence modeling, using classical language models like continuous bag of words (CBOW), without needing positional information or attention mechanisms. However, positional information becomes crucial for logic reasoning tasks requiring generalization to unseen tokens. Finally, we identify two cases where ICL fails: one in logic reasoning tasks that require generalizing to new, unseen patterns, and another in analogy completion where relevant word pairs appear only in fixed training positions. These findings suggest that LLMs' ICL abilities depend heavily on the structural elements within their training data.
Deep Multi-Modal Structural Equations For Causal Effect Estimation With Unstructured Proxies
Estimating the effect of intervention from observational data while accounting for confounding variables is a key task in causal inference. Oftentimes, the confounders are unobserved, but we have access to large amounts of additional unstructured data (images, text) that contain valuable proxy signal about the missing confounders. This paper argues that leveraging this unstructured data can greatly improve the accuracy of causal effect estimation. Specifically, we introduce deep multi-modal structural equations, a generative model for causal effect estimation in which confounders are latent variables and unstructured data are proxy variables. This model supports multiple multimodal proxies (images, text) as well as missing data. We empirically demonstrate that our approach outperforms existing methods based on propensity scores and corrects for confounding using unstructured inputs on tasks in genomics and healthcare.
- North America > Canada > Alberta (0.14)
- North America > United States > New York (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (5 more...)
- Information Technology (0.93)
- Media (0.68)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Tennessee (0.04)
- Europe > France (0.04)
- Europe > Albania > Durrës County > Durrës (0.04)
Cross-Modal Temporal Fusion for Financial Market Forecasting
Pei, Yunhua, Cartlidge, John, Mandal, Anandadeep, Gold, Daniel, Marcilio, Enrique, Mazzon, Riccardo
Accurate forecasting in financial markets requires integrating diverse data sources, from historical prices to macroeconomic indicators and financial news. However, existing models often fail to align these modalities effectively, limiting their practical use. In this paper, we introduce a transformer-based deep learning framework, Cross-Modal Temporal Fusion (CMTF), that fuses structured and unstructured financial data for improved market prediction. The model incorporates a tensor interpretation module for feature selection and an auto-training pipeline for efficient hyperparameter tuning. Experimental results using FTSE 100 stock data demonstrate that CMTF achieves superior performance in price direction classification compared to classical and deep learning baselines. These findings suggest that our framework is an effective and scalable solution for real-world cross-modal financial forecasting tasks.
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.05)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.05)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- (4 more...)
Revealing Multimodal Causality with Large Language Models
Li, Jin, Wang, Shoujin, Zhang, Qi, Liu, Feng, Liu, Tongliang, Cao, Longbing, Yu, Shui, Chen, Fang
Uncovering cause-and-effect mechanisms from data is fundamental to scientific progress. While large language models (LLMs) show promise for enhancing causal discovery (CD) from unstructured data, their application to the increasingly prevalent multimodal setting remains a critical challenge. Even with the advent of multimodal LLMs (MLLMs), their efficacy in multimodal CD is hindered by two primary limitations: (1) difficulty in exploring intra- and inter-modal interactions for comprehensive causal variable identification; and (2) insufficiency to handle structural ambiguities with purely observational data. To address these challenges, we propose MLLM-CD, a novel framework for multimodal causal discovery from unstructured data. It consists of three key components: (1) a novel contrastive factor discovery module to identify genuine multimodal factors based on the interactions explored from contrastive sample pairs; (2) a statistical causal structure discovery module to infer causal relationships among discovered factors; and (3) an iterative multimodal counterfactual reasoning module to refine the discovery outcomes iteratively by incorporating the world knowledge and reasoning capabilities of MLLMs. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLLM-CD in revealing genuine factors and causal relationships among them from multimodal unstructured data.
- Oceania > Australia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Andalusia > Seville Province > Seville (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (0.70)
- Health & Medicine > Diagnostic Medicine > Imaging (0.67)
- Health & Medicine > Therapeutic Area > Oncology > Lung Cancer (0.48)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.45)
A Case for Computing on Unstructured Data
Sadia, Mushtari, Chowdhury, Amrita Roy, Chen, Ang
Unstructured data, such as text, images, audio, and video, comprises the vast majority of the world's information, yet it remains poorly supported by traditional data systems that rely on structured formats for computation. We argue for a new paradigm, which we call computing on unstructured data, built around three stages: extraction of latent structure, transformation of this structure through data processing techniques, and projection back into unstructured formats. This bi-directional pipeline allows unstructured data to benefit from the analytical power of structured computation, while preserving the richness and accessibility of unstructured representations for human and AI consumption. We illustrate this paradigm through two use cases and present the research components that need to be developed in a new data system called MXFlow.
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- North America > Canada (0.04)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.68)
- North America > United States > Oregon (0.14)
- North America > Canada > Alberta (0.14)
- Europe > Russia (0.14)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Workflow (0.66)
- Information Technology (0.93)
- Media (0.68)
- Energy > Oil & Gas > Upstream (0.34)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Tennessee (0.04)
- Europe > France (0.04)
- Europe > Albania > Durrës County > Durrës (0.04)
Personalized Treatment Effect Estimation from Unstructured Data
Arno, Henri, Demeester, Thomas
Existing methods for estimating personalized treatment effects typically rely on structured covariates, limiting their applicability to unstructured data. Yet, leveraging unstructured data for causal inference has considerable application potential, for instance in healthcare, where clinical notes or medical images are abundant. To this end, we first introduce an approximate 'plug-in' method trained directly on the neural representations of unstructured data. However, when these fail to capture all confounding information, the method may be subject to confounding bias. We therefore introduce two theoretically grounded estimators that leverage structured measurements of the confounders during training, but allow estimating personalized treatment effects purely from unstructured inputs, while avoiding confounding bias. When these structured measurements are only available for a non-representative subset of the data, these estimators may suffer from sampling bias. To address this, we further introduce a regression-based correction that accounts for the non-uniform sampling, assuming the sampling mechanism is known or can be well-estimated. Our experiments on two benchmark datasets show that the plug-in method, directly trainable on large unstructured datasets, achieves strong empirical performance across all settings, despite its simplicity.
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Health & Medicine > Health Care Technology > Medical Record (1.00)
- Health & Medicine > Therapeutic Area (0.69)